import io
import os
import cv2
import ast
import time
import imagehash
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from tqdm import tqdm
from PIL import Image
from utils.visualization import display_images
from google.cloud import vision
from google.oauth2 import service_account
The goals of preprocessing - delete possible duplicates and generate weak labels based on the data itself.
I will proceed from hypothesis that logo is something popular, so it can be represented in the data more one times. Of course, new logos could be difficult to detect using this approach, but as the very first step I should check that. And that task strontly connected to the duplicates elimination - as main preprocessing step before training models.
Because duplicates in the dataset with different names could lead to data leakage, ambigous labels and overfitting in the very edge cases.
Also I should try to generate weak labels from the data to train model classifier. I don't think this problem can be fully solved via unsupervised methods, because discrimination between real and fake logos is subjective process and requires at least approximate labels. The most prominent approach here is semi-supervised learning (SSL) , when we have a lot of unlabeled data and the small percentage of real or approximate labels. These requirments allows to use typical deep learning classification models with modified semi-supervised pipeline.
1) Find duplicates using perceptual hash similary
2) Filter duplicates, compute number of occurences
3) Use number of occurences > 1 as condition to be "true" labeled
4) Use Google API Logo Detection and label 1000 images
The resulted model should have low number of False Positives , so it leads to Precision = $\frac{TP}{TP+FP} \rightarrow \max$ . However, we don't have any true label, except of weak labels that we can extract from the data. In this case, high precision over weak labels couldn't give guarantess model will give correct results.
That why I labeled 2014 images from dataset and will use it for validation of semi-supervised pipeline (and I don't use it for training).
I didn't use any other datasets with logos, because, here we have a lot of misdetections in the images and a lot of different logos as well. Typical datasets like WebLogo-2M has only 194 logos, so I think very little compared to this dataset (I think so after visualization of data). So my decision to mark part of the data for validation will simplify process of final testing of the different methods.
DATASET_FOLDER = "test_task_logo_dataset"
# get all images, create csv dataframe with images
files = next(os.walk(DATASET_FOLDER))[2]
files = list(filter(lambda x: x.endswith("jpg"), files))
files = list(map(lambda x: os.path.join(DATASET_FOLDER, x), files))
files = np.array(files)
df = pd.DataFrame([])
df["path"] = files
# compute perceptual hash of images
df["hash"] = df["path"].apply(lambda x: str(imagehash.phash(Image.open(x))))
display_images(df, 5, 5)
Suppose model will return $1.0$ score for every image: $f(x) = 1$. Using random images visualization, we can estimate approximate precision of such model and set it as "benchmark" to compare with future classifier. I counted myself false positives few runs of images displaying and called it visual Precision:
How quickly compute similarity between images in entire dataset? I decided to use perceptual hash for this task. It's very fast, doesn't use any pretrained model and can find approximate duplicates.
Perceptual hash over input image generates 8x8 boolean descriptor. If images are similar their boolean descriptors will be similar and hence have low or zero element-wise distance - chose threshold $T$ and get images with distance $
$A,\ B \in {\{0,\ 1\}}^{8\times8}$ with distance $d(A,\ B)=\sum_{ij} |a_{ij} - b_{ij}| \in \{0,...,64\}$
Via linear transformation $2x + 1$ applyied to the matrices, elements are transformed from $\{0,\ 1\}$ to $\{-1,\ 1\}$, we can construct functional using Einstein summation notation to compute similarities. In this case, similar images should have similarity score $>T$.
$A,\ B \in {\{-1,\ 1\}}^{8\times8}$ with similarity $s(A,\ B)=\sum_{ij} a_{ij}b_{ij} \in \{-64, ..., 64\}$
def compute_groups(hashes, threshold=50):
"""
Finds similar images using perceptual hash
similarity as matching criteria
Parameters
----------
hashes : pd.Series
Pandas series of hashes
threshold : int
Threshold determines match between
two 8x8 boolean perceptual hashes
Returns
-------
groups : list
List of lists with indexes of similar
images, e.g [[0], [1, 35], ...]
"""
# get hashes as int8 array - (N, 8) shape
hashes = hashes.apply(lambda x: 2 * imagehash.hex_to_hash(x).hash.astype(np.int8) - 1)
hashes = np.array(hashes.tolist())
# compute similarity using einstein formalism - resulted in (N, N) matrix
matrix = np.einsum("kij,mij->km", hashes, hashes)
# group values
groups = map(lambda x: list(np.where(x > threshold)[0]), list(matrix))
return groups
groups = compute_groups(df.hash)
df["group"] = -1
with tqdm(ascii=True, leave=False, total=len(df)) as bar:
for i, group in enumerate(groups):
# if any element of a group has no similar
# assign it label to this group
if (df["group"].loc[group] == -1).any():
df.loc[group, "group"] = i
bar.update()
df.to_csv("datasets/dataset_full.csv", index=False)
# compute number of elements in each group
counts = df.groupby("group").count().reset_index()[["group", "path"]]
counts = counts.rename(columns={"path": "n_images"})
# get only one image from each group
clean = df.groupby("group").apply(lambda g: g["path"].iloc[0]).reset_index()
clean = clean.rename(columns={0: "path"})
# merge based on group id
clean = clean.merge(counts, how="inner", on="group")
clean.to_csv("datasets/dataset.csv", index=False)
display_images(clean[clean.n_images > 1], 5, 5)
Seems like very noisy, but for the first step will be okay.
display_images(df.iloc[:-200], 7, 7)
I tried to use Python Google API Logo Detection to collect additional, more robust, labels. It allows free tier in 1000 labels per month, so I used it.
df = pd.read_csv("datasets/dataset.csv")
credentials = service_account.Credentials.from_service_account_file('/Users/vaden4d/Downloads/creds.json')
client = vision.ImageAnnotatorClient(credentials=credentials)
with tqdm(ascii=True, leave=False, total=1000) as bar:
for index, row in df.iloc[-1000:].iterrows():
path = row.path
with io.open(path, 'rb') as image_file:
content = image_file.read()
image = vision.Image(content=content)
response = client.logo_detection(image=image)
try:
df.loc[index, "api_score"] = response.logo_annotations[0].score
df.loc[index, "api_name"] = response.logo_annotations[0].description
except Exception as e:
df.loc[index, "api_score"] = 0.0
df.loc[index, "api_name"] = "no_logo"
time.sleep(0.2)
bar.update()
df
| group | path | n_images | api_score | api_name | |
|---|---|---|---|---|---|
| 0 | 0 | test_task_logo_dataset/a7ee2ad0-f27a-4189-9207... | 1 | -1.000000 | -1 |
| 1 | 1 | test_task_logo_dataset/a3ff6c0b-1aa2-4fb2-be20... | 1 | -1.000000 | -1 |
| 2 | 2 | test_task_logo_dataset/e2524a86-a652-4181-90c4... | 1 | -1.000000 | -1 |
| 3 | 3 | test_task_logo_dataset/f0b3e6ef-f025-471c-942e... | 2 | -1.000000 | -1 |
| 4 | 4 | test_task_logo_dataset/42b000e9-4346-486b-b3d6... | 1 | -1.000000 | -1 |
| ... | ... | ... | ... | ... | ... |
| 32153 | 36240 | test_task_logo_dataset/4809662e-1d0e-4a25-a721... | 1 | 0.586603 | Polish Press Agency |
| 32154 | 36241 | test_task_logo_dataset/64108cf4-9d72-4003-a8f9... | 1 | 0.000000 | no_logo |
| 32155 | 36242 | test_task_logo_dataset/e5110399-79a9-4864-8227... | 1 | 0.000000 | no_logo |
| 32156 | 36243 | test_task_logo_dataset/2c955ab4-89ad-41fd-8422... | 1 | 0.936305 | Amazon |
| 32157 | 36244 | test_task_logo_dataset/fd017c12-c8a4-444f-a2bf... | 1 | 0.000000 | no_logo |
32158 rows × 5 columns
df.to_csv("datasets/dataset_with_labels.csv", index=False)
display_images(df[df.api_score > -1], 5, 5, title_column="api_name")
Via visual inspection of precision using random samples from 1000 labeled images, I computed approximate precision of Google API: $Precision = 39.6%$. It's very low, so I decided don't use its positive predictions. However, I observed that if model doesn't predict logo, it looks reliable enough. So I decided to use its negative predictions as weak labels for negative class.
display_images(df[df.api_name == "no_logo"], 5, 5, title_column="api_name")
df[df.api_name == "no_logo"]
| group | path | n_images | api_score | api_name | |
|---|---|---|---|---|---|
| 31158 | 35047 | test_task_logo_dataset/766c2009-0049-4eb3-a89a... | 1 | 0.0 | no_logo |
| 31159 | 35048 | test_task_logo_dataset/b0f9fed4-892a-4c3e-b24a... | 1 | 0.0 | no_logo |
| 31160 | 35049 | test_task_logo_dataset/8860d957-9d87-4958-ac27... | 1 | 0.0 | no_logo |
| 31161 | 35050 | test_task_logo_dataset/7c247fcb-3786-45ab-8601... | 1 | 0.0 | no_logo |
| 31163 | 35053 | test_task_logo_dataset/eb1b0c94-b894-4a45-bca2... | 1 | 0.0 | no_logo |
| ... | ... | ... | ... | ... | ... |
| 32151 | 36238 | test_task_logo_dataset/43a483b3-c867-4e4a-90c2... | 1 | 0.0 | no_logo |
| 32152 | 36239 | test_task_logo_dataset/1d29ffeb-459d-4587-8470... | 1 | 0.0 | no_logo |
| 32154 | 36241 | test_task_logo_dataset/64108cf4-9d72-4003-a8f9... | 1 | 0.0 | no_logo |
| 32155 | 36242 | test_task_logo_dataset/e5110399-79a9-4864-8227... | 1 | 0.0 | no_logo |
| 32157 | 36244 | test_task_logo_dataset/fd017c12-c8a4-444f-a2bf... | 1 | 0.0 | no_logo |
369 rows × 5 columns
df["weak_label"] = df.apply(lambda x: 0 if x.api_name == "no_logo" else 1 if x.n_images > 1 else -1, axis=1)
df.to_csv("datasets/dataset_with_weakly_labels.csv")
I desided to boost labels based on the perceptual hash similarity and Google API score with additional information - about image entropy and its possible text on the image. Because, during my exploration of the dataset, I noticed few the most popular misdetections:
1) Text over single-colored background - not logo, it has text and probably low entropy
2) Random crops like elements of cars, real world - possibly has high entropy
3) Text from images can give additional information
import cv2
import pytesseract
from pytesseract import Output
from scipy.stats import entropy
df = pd.read_csv("ocr.csv")
df.ocr = df.ocr.apply(ast.literal_eval)
def compute_entropy(img):
"""
Computes entropy for each channel of RGB image
Parameters
----------
img : np.ndarray
Image represented as (H, W, 3) array
Returns
-------
entropy(r), entropy(g), entropy(b) : float
Entropy of each channels
"""
r, _ = np.histogram(img[..., 0], bins=255)
g, _ = np.histogram(img[..., 1], bins=255)
b, _ = np.histogram(img[..., 2], bins=255)
return entropy(r), entropy(g), entropy(b)
df["entropy_r"] = 0
df["entropy_g"] = 0
df["entropy_b"] = 0
df["h"] = 0
df["w"] = 0
with tqdm(ascii=True, leave=False, total=len(df)) as bar:
for index, row in df.iterrows():
img = np.array(Image.open(row.path))
r, g, b = compute_entropy(img)
df.loc[index, "entropy_r"] = r
df.loc[index, "entropy_g"] = g
df.loc[index, "entropy_b"] = b
df.loc[index, "h"] = img.shape[0]
df.loc[index, "w"] = img.shape[1]
bar.update()
df["entropy"] = (df["entropy_r"] + df["entropy_g"] + df["entropy_b"]) / 3.0
df = df.sort_values(by="entropy")
After this step I executed OCR extraction in the Google Colab, so from here dataframe has an additional column ocr .
df[df.entropy > df.entropy.quantile(q=0.99)]
| Unnamed: 0 | group | path | n_images | api_score | api_name | weak_label | ocr | entropy_r | entropy_g | entropy_b | h | w | entropy | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 24476 | 24476 | 27147 | test_task_logo_dataset/1b9b390a-e6d7-4857-95d7... | 1 | -1.0 | -1 | -1 | [] | 5.335217 | 5.337528 | 5.311017 | 38 | 80 | 5.327921 |
| 14104 | 14104 | 15186 | test_task_logo_dataset/8dc070c7-9053-4af5-b7fe... | 1 | -1.0 | -1 | -1 | [] | 5.391945 | 5.321680 | 5.270761 | 57 | 60 | 5.328128 |
| 7744 | 7744 | 8212 | test_task_logo_dataset/bc9a611b-474f-46cc-b4e5... | 1 | -1.0 | -1 | -1 | [] | 5.265150 | 5.343976 | 5.375451 | 103 | 56 | 5.328193 |
| 4457 | 4457 | 4677 | test_task_logo_dataset/768825ee-ead2-48ae-8083... | 1 | -1.0 | -1 | -1 | [] | 5.328251 | 5.329780 | 5.326611 | 56 | 58 | 5.328214 |
| 28591 | 28591 | 32003 | test_task_logo_dataset/474b1ffd-2cc1-4644-96b9... | 1 | -1.0 | -1 | -1 | [] | 5.364165 | 5.349623 | 5.271290 | 110 | 96 | 5.328359 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 12088 | 12088 | 12960 | test_task_logo_dataset/b48faaad-48a5-479b-a38a... | 1 | -1.0 | -1 | -1 | [('', (0, 0, 0, 607))] | 5.388961 | 5.464780 | 5.526125 | 607 | 1151 | 5.459955 |
| 25945 | 25945 | 28866 | test_task_logo_dataset/22f31592-ee02-461c-b01f... | 1 | -1.0 | -1 | -1 | [] | 5.468097 | 5.468962 | 5.467702 | 58 | 122 | 5.468254 |
| 24922 | 24922 | 27669 | test_task_logo_dataset/2deca010-08b8-4f9d-bddf... | 1 | -1.0 | -1 | -1 | [] | 5.446546 | 5.473996 | 5.490587 | 80 | 90 | 5.470376 |
| 19858 | 19858 | 21724 | test_task_logo_dataset/8d7fa911-3b16-419e-b82e... | 1 | -1.0 | -1 | -1 | [] | 5.468868 | 5.472851 | 5.477188 | 77 | 53 | 5.472969 |
| 13686 | 13686 | 14732 | test_task_logo_dataset/ebff7b75-f7b6-459e-b282... | 1 | -1.0 | -1 | -1 | [('', (0, 0, 175, 108))] | 5.490151 | 5.453237 | 5.486332 | 108 | 175 | 5.476573 |
322 rows × 14 columns
df[(df.ocr != '[]')]
| Unnamed: 0 | group | path | n_images | api_score | api_name | weak_label | ocr | entropy_r | entropy_g | entropy_b | h | w | entropy | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 30505 | 30505 | 34268 | test_task_logo_dataset/30230c74-6a4e-4a41-bd01... | 1 | -1.0 | -1 | -1 | [('fi', (70, 18, 18, 26))] | 1.064124 | 0.856394 | 1.064124 | 62 | 188 | 0.994881 |
| 19890 | 19890 | 21762 | test_task_logo_dataset/3cb6a4fb-a5dc-43da-871b... | 1 | -1.0 | -1 | -1 | [('10%', (17, 19, 23, 9))] | 1.119453 | 1.119453 | 1.119453 | 39 | 60 | 1.119453 |
| 79 | 79 | 82 | test_task_logo_dataset/df739121-944d-45e4-83bd... | 1 | -1.0 | -1 | -1 | [('»', (51, 27, 45, 52))] | 1.243734 | 1.226790 | 1.191934 | 108 | 141 | 1.220819 |
| 27800 | 27800 | 31043 | test_task_logo_dataset/8dc57f2f-975d-4afd-b490... | 1 | -1.0 | -1 | -1 | [(' ', (44, 0, 363, 134)), (' ', (445, 110... | 0.989292 | 1.441755 | 1.651524 | 266 | 454 | 1.360857 |
| 6480 | 6480 | 6848 | test_task_logo_dataset/e18507a0-c408-428d-b844... | 1 | -1.0 | -1 | -1 | [('S', (0, 16, 93, 92))] | 1.058326 | 1.293013 | 1.794793 | 108 | 93 | 1.382044 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 6236 | 6236 | 6584 | test_task_logo_dataset/621ec1f0-392c-4041-acaf... | 1 | -1.0 | -1 | -1 | [('', (0, 0, 132, 42))] | 5.423659 | 5.479705 | 5.424971 | 42 | 132 | 5.442778 |
| 22497 | 22497 | 24793 | test_task_logo_dataset/073343ce-d29a-4b24-af6f... | 1 | -1.0 | -1 | -1 | [('', (0, 0, 132, 162))] | 5.442437 | 5.465940 | 5.432988 | 162 | 132 | 5.447122 |
| 16995 | 16995 | 18444 | test_task_logo_dataset/e43354f0-407f-4ab4-af9a... | 1 | -1.0 | -1 | -1 | [('', (0, 0, 120, 115))] | 5.468872 | 5.460965 | 5.431633 | 115 | 120 | 5.453823 |
| 12088 | 12088 | 12960 | test_task_logo_dataset/b48faaad-48a5-479b-a38a... | 1 | -1.0 | -1 | -1 | [('', (0, 0, 0, 607))] | 5.388961 | 5.464780 | 5.526125 | 607 | 1151 | 5.459955 |
| 13686 | 13686 | 14732 | test_task_logo_dataset/ebff7b75-f7b6-459e-b282... | 1 | -1.0 | -1 | -1 | [('', (0, 0, 175, 108))] | 5.490151 | 5.453237 | 5.486332 | 108 | 175 | 5.476573 |
10727 rows × 14 columns
df[df.api_score > 0.7]
| Unnamed: 0 | group | path | n_images | api_score | api_name | weak_label | ocr | entropy_r | entropy_g | entropy_b | h | w | entropy | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 31977 | 31977 | 36039 | test_task_logo_dataset/0a753f19-5431-4ba5-bcce... | 1 | 0.829048 | Secretlab | -1 | [] | 1.761039 | 1.761039 | 1.761039 | 73 | 78 | 1.761039 |
| 31722 | 31722 | 35721 | test_task_logo_dataset/ed09627d-b399-4be5-ad72... | 1 | 0.857459 | Czech Radio | -1 | [] | 1.998935 | 1.998935 | 1.998935 | 56 | 64 | 1.998935 |
| 31585 | 31585 | 35561 | test_task_logo_dataset/525037fe-82ff-48cb-a0a8... | 1 | 0.737849 | Gas Safe Register | -1 | [(Sat, (18, 16, 28, 15)), (, (0, 44, 61, 1))] | 1.918528 | 1.954471 | 2.168298 | 45 | 61 | 2.013766 |
| 31623 | 31623 | 35606 | test_task_logo_dataset/1cb8e17e-1c6a-49c4-b9fc... | 1 | 0.968081 | Visa Inc. | -1 | [] | 2.149106 | 2.084672 | 2.216203 | 401 | 591 | 2.149993 |
| 31999 | 31999 | 36064 | test_task_logo_dataset/d46beed5-a44e-4f6e-a75f... | 1 | 0.944741 | Santander Bank | -1 | [(, (11, 0, 153, 52))] | 2.106873 | 2.641687 | 2.798778 | 56 | 169 | 2.515779 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 31431 | 31431 | 35372 | test_task_logo_dataset/e48a03d7-0e65-4fb8-9a74... | 1 | 0.963498 | Ben & Jerry's | -1 | [(ate., (63, 80, 45, 19))] | 5.294412 | 5.252222 | 5.298975 | 99 | 332 | 5.281870 |
| 31248 | 31248 | 35153 | test_task_logo_dataset/b6af3a8d-89bb-4c6a-a254... | 1 | 0.819126 | Kit Kat | -1 | [] | 5.383739 | 5.258531 | 5.247514 | 80 | 130 | 5.296594 |
| 31982 | 31982 | 36045 | test_task_logo_dataset/f91b38ce-0c84-47d5-97e4... | 1 | 0.809999 | Castrol | -1 | [] | 5.361373 | 5.329003 | 5.215405 | 34 | 99 | 5.301927 |
| 31787 | 31787 | 35803 | test_task_logo_dataset/25cf8da6-a618-4354-9416... | 1 | 0.751576 | La Liga | -1 | [(, (0, 0, 60, 74))] | 5.380603 | 5.319126 | 5.213046 | 74 | 60 | 5.304258 |
| 31283 | 31283 | 35198 | test_task_logo_dataset/c0151821-4801-4c41-9ba7... | 1 | 0.962751 | Toyota | -1 | [] | 5.346213 | 5.321667 | 5.352011 | 77 | 74 | 5.339964 |
229 rows × 14 columns
q = df.entropy.quantile(q=0.05)
def generate_weak_label(row, q):
if row["api_name"] == "no_logo" or row["entropy"] < q:
return 0
elif row["n_images"] > 1:
return 1
else:
return -1
df["weak_label_entropy"] = df.apply(lambda x: generate_weak_label(x, q), axis=1)
df.to_csv("dataset_with_weak_label_entropy.csv", index=False)
df.groupby("weak_label_entropy").count()
| Unnamed: 0 | group | path | n_images | api_score | api_name | weak_label | ocr | entropy_r | entropy_g | entropy_b | h | w | entropy | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| weak_label_entropy | ||||||||||||||
| -1 | 28443 | 28443 | 28443 | 28443 | 28443 | 28443 | 28443 | 28443 | 28443 | 28443 | 28443 | 28443 | 28443 | 28443 |
| 0 | 1961 | 1961 | 1961 | 1961 | 1961 | 1961 | 1961 | 1961 | 1961 | 1961 | 1961 | 1961 | 1961 | 1961 |
| 1 | 1754 | 1754 | 1754 | 1754 | 1754 | 1754 | 1754 | 1754 | 1754 | 1754 | 1754 | 1754 | 1754 | 1754 |
labeled = pd.read_csv("datasets/labeled_part.csv")
weak = pd.read_csv("datasets/dataset_with_weak_label_entropy.csv")
weak = labeled[:1000].merge(weak, on="path", how="right")
weak["mixed_label"] = weak.apply(lambda x: 1 if x.label == "logo" else 0 if x.label == "no_logo" else x.weak_label_entropy, axis=1)
weak.to_csv("datasets/mixed.csv", index=False)
display_images(df[df.weak_label_entropy == 1], 7, 7)
df[(df.entropy > df.entropy.quantile(q=0.45)) & (df.entropy < df.entropy.quantile(q=0.55))]
| Unnamed: 0 | group | path | n_images | api_score | api_name | weak_label | ocr | entropy_r | entropy_g | entropy_b | h | w | entropy | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 23493 | 23493 | 25960 | test_task_logo_dataset/65673628-3f99-4c43-bae5... | 1 | -1.0 | -1 | -1 | [] | 4.375413 | 4.397909 | 4.515270 | 49 | 41 | 4.429531 |
| 14169 | 14169 | 15258 | test_task_logo_dataset/699435a4-ee63-44b4-8114... | 1 | -1.0 | -1 | -1 | [] | 4.383097 | 4.465817 | 4.439695 | 75 | 37 | 4.429536 |
| 3176 | 3176 | 3316 | test_task_logo_dataset/d8d5a233-a40c-4acf-b185... | 1 | -1.0 | -1 | -1 | [(Close, (16, 0, 169, 75))] | 3.937051 | 4.578558 | 4.773156 | 83 | 189 | 4.429588 |
| 6697 | 6697 | 7075 | test_task_logo_dataset/05ab2601-bf4e-4e3e-b39c... | 1 | -1.0 | -1 | -1 | [] | 4.487521 | 4.442410 | 4.359000 | 34 | 131 | 4.429643 |
| 24572 | 24572 | 27259 | test_task_logo_dataset/e22bbcae-591f-4239-9176... | 1 | -1.0 | -1 | -1 | [] | 4.411517 | 4.665574 | 4.212026 | 54 | 47 | 4.429706 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 26394 | 26394 | 29392 | test_task_logo_dataset/c7b1f5bd-5b2f-4a73-a546... | 1 | -1.0 | -1 | -1 | [] | 4.593893 | 4.546759 | 4.634288 | 49 | 125 | 4.591646 |
| 30245 | 30245 | 33957 | test_task_logo_dataset/a9deee99-1dd8-4a8a-8ae5... | 1 | -1.0 | -1 | -1 | [] | 4.943807 | 4.452978 | 4.378289 | 144 | 189 | 4.591691 |
| 11880 | 11880 | 12735 | test_task_logo_dataset/79cc327d-d509-43ec-a88e... | 1 | -1.0 | -1 | -1 | [] | 4.627928 | 4.608225 | 4.539067 | 40 | 43 | 4.591740 |
| 10630 | 10630 | 11369 | test_task_logo_dataset/0253be37-8a81-42b6-add4... | 1 | -1.0 | -1 | -1 | [(, (0, 14, 206, 142))] | 4.457937 | 4.537969 | 4.779414 | 156 | 206 | 4.591773 |
| 1879 | 1879 | 1957 | test_task_logo_dataset/f701374c-236a-41d3-abfc... | 1 | -1.0 | -1 | -1 | [] | 4.659987 | 4.553981 | 4.561449 | 107 | 143 | 4.591805 |
3216 rows × 14 columns
display_images(df[df.n_images > 1], 5, 5)
import ast
df = df.ocr.apply(lambda x: ast.literal_eval(x))
display_images(df[df.n_images > 1], 5, 5)
display_images(df[(df.entropy > df.entropy.quantile(q=0.95)) & (df.ocr != '[]')], 5, 5)
display_images(df[df.entropy < df.entropy.quantile(q=0.05)], 7, 7)
df[df.weak_label > -1].groupby("weak_label").count()
| Unnamed: 0 | group | path | n_images | api_score | api_name | |
|---|---|---|---|---|---|---|
| weak_label | ||||||
| 0 | 369 | 369 | 369 | 369 | 369 | 369 |
| 1 | 1873 | 1873 | 1873 | 1873 | 1873 | 1873 |
img = Image.open(df.sample(n=1).path.values[0])
img
def plot_text(img, min_conf):
results = pytesseract.image_to_data(img, output_type=Output.DICT)
# loop over each of the individual text localizations
outputs = []
for i in range(0, len(results["text"])):
# extract the bounding box coordinates of the text region from
# the current result
x = results["left"][i]
y = results["top"][i]
w = results["width"][i]
h = results["height"][i]
# extract the OCR text itself along with the confidence of the
# text localization
text = results["text"][i]
conf = int(results["conf"][i])
if conf > min_conf:
outputs.append((text, (x, y, w, h)))
# display the confidence and text to our terminal
#print("Confidence: {}".format(conf))
#print("Text: {}".format(text))
#print("")
# strip out non-ASCII text so we can draw the text on the image
# using OpenCV, then draw a bounding box around the text along
# with the text itself
#text = "".join([c if ord(c) < 128 else "" for c in text]).strip()
#cv2.rectangle(img, (x, y), (x + w, y + h), (0, 255, 0), 2)
#cv2.putText(img, text, (x, y - 10), cv2.FONT_HERSHEY_SIMPLEX,
# 1.2, (0, 0, 255), 3)
return outputs